Goto

Collaborating Authors

 Harju County



What Is Claude? Anthropic Doesn't Know, Either

The New Yorker

Researchers at the company are trying to understand their A.I. system's mind--examining its neurons, running it through psychology experiments, and putting it on the therapy couch. It has become increasingly clear that Claude's selfhood, much like our own, is a matter of both neurons and narratives. A large language model is nothing more than a monumental pile of small numbers. It converts words into numbers, runs those numbers through a numerical pinball game, and turns the resulting numbers back into words. Similar piles are part of the furniture of everyday life. Meteorologists use them to predict the weather. Epidemiologists use them to predict the paths of diseases. Among regular people, they do not usually inspire intense feelings. But when these A.I. systems began to predict the path of a sentence--that is, to talk--the reaction was widespread delirium. As a cognitive scientist wrote recently, "For hurricanes or pandemics, this is as rigorous as science gets; for sequences of words, everyone seems to lose their mind." It's hard to blame them. Language is, or rather was, our special thing. We weren't prepared for the arrival of talking machines. Ellie Pavlick, a computer scientist at Brown, has drawn up a taxonomy of our most common responses. There are the "fanboys," who man the hype wires. They believe that large language models are intelligent, maybe even conscious, and prophesy that, before long, they will become superintelligent. The venture capitalist Marc Andreessen has described A.I. as "our alchemy, our Philosopher's Stone--we are literally making sand think." The fanboys' deflationary counterparts are the "curmudgeons," who claim that there's no there, and that only a blockhead would mistake a parlor trick for the soul of the new machine. In the recent book " The AI Con," the linguist Emily Bender and the sociologist Alex Hanna belittle L.L.M.s as "mathy maths," "stochastic parrots," and "a racist pile of linear algebra." But, Pavlick writes, "there is another way to react." It is O.K., she offers, "to not know." What Pavlick means, on the most basic level, is that large language models are black boxes. We don't really understand how they work. We don't know if it makes sense to call them intelligent, or if it will ever make sense to call them conscious. The existence of talking machines--entities that can do many of the things that only we have ever been able to do--throws a lot of other things into question. We refer to our own minds as if they weren't also black boxes.


Fluent Alignment with Disfluent Judges: Post-training for Lower-resource Languages

Samuel, David, Øvrelid, Lilja, Velldal, Erik, Kutuzov, Andrey

arXiv.org Artificial Intelligence

We propose a post-training method for lower-resource languages that preserves fluency of language models even when aligned by disfluent reward models. Preference-optimization is now a well-researched topic, but previous work has mostly addressed models for English and Chinese. Lower-resource languages lack both datasets written by native speakers and language models capable of generating fluent synthetic data. Thus, in this work, we focus on developing a fluent preference-aligned language model without any instruction-tuning data in the target language. Our approach uses an on-policy training method, which we compare with two common approaches: supervised finetuning on machine-translated data and multilingual finetuning. We conduct a case study on Norwegian Bokmål and evaluate fluency through native-speaker assessments. The results show that the on-policy aspect is crucial and outperforms the alternatives without relying on any hard-to-obtain data.


PPTArena: A Benchmark for Agentic PowerPoint Editing

Ofengenden, Michael, Man, Yunze, Pang, Ziqi, Wang, Yu-Xiong

arXiv.org Artificial Intelligence

W e introduce PPTArena, a benchmark for PowerPoint editing that measures reliable modifications to real slides under natural-language instructions. In contrast to image-PDF renderings or text-to-slide generation, PPTArena focuses on in-place editing across 100 decks, 2,125 slides, and over 800 targeted edits covering text, charts, tables, animations, and master-level styles. Each case includes a ground-truth deck, a fully specified target outcome, and a dual VLM-as-judge pipeline that separately scores instruction following and visual quality using both structural diffs and slide images. Building on this setting, we propose PPTPilot, a structure-aware slide-editing agent that plans semantic edit sequences, routes between high-level programmatic tools and deterministic XML operations for precise control, and verifies outputs through an iterative plan-edit-check loop against task-specific constraints. In our experiments, PPTPilot outperforms strong proprietary agents and frontier VLM systems by over 10 percentage points on compound, layout-sensitive, and cross-slide edits, with particularly large gains in visual fidelity and deck-wide consistency. Despite these improvements, existing agents still underperform on long-horizon, document-scale tasks in PPTArena, highlighting the remaining challenges in reliable PPT editing.


EvoMem: Improving Multi-Agent Planning with Dual-Evolving Memory

Fan, Wenzhe, Yan, Ning, Mortazavi, Masood

arXiv.org Artificial Intelligence

Planning has been a cornerstone of artificial intelligence for solving complex problems, and recent progress in LLM-based multi-agent frameworks have begun to extend this capability. However, the role of human-like memory within these frameworks remains largely unexplored. Understanding how agents coordinate through memory is critical for natural language planning, where iterative reasoning, constraint tracking, and error correction drive the success. Inspired by working memory model in cognitive psychology, we present EvoMem, a multi-agent framework built on a dual-evolving memory mechanism. The framework consists of three agents (Constraint Extractor, Verifier, and Actor) and two memory modules: Constraint Memory (CMem), which evolves across queries by storing task-specific rules and constraints while remains fixed within a query, and Query-feedback Memory (QMem), which evolves within a query by accumulating feedback across iterations for solution refinement. Both memory modules are reset at the end of each query session. Evaluations on trip planning, meeting planning, and calendar scheduling show consistent performance improvements, highlighting the effectiveness of EvoMem. This success underscores the importance of memory in enhancing multi-agent planning.


Scores of UK parliamentarians join call to regulate most powerful AI systems

The Guardian

The campaign is demanding stricter controls on frontier systems, citing fears superintelligent AI could'compromise national and global security'. The campaign is demanding stricter controls on frontier systems, citing fears superintelligent AI could'compromise national and global security'. More than 100 UK parliamentarians are calling on the government to introduce binding regulations on the most powerful AI systems as concern grows that ministers are moving too slowly to create safeguards in the face of lobbying from the technology industry. A former AI minister and defence secretary are part of a cross-party group of Westminster MPs, peers and elected members of the Scottish, Welsh and Northern Irish legislatures demanding stricter controls on frontier systems, citing fears superintelligent AI "would compromise national and global security". The push for tougher regulation is being coordinated by a nonprofit organisation called Control AI whose backers include the co-founder of Skype, Jaan Tallinn.


Enhancing SPARQL Query Rewriting for Complex Ontology Alignments

Ondo, Anicet Lepetit, Capus, Laurence, Bousso, Mamadou

arXiv.org Artificial Intelligence

SPARQL query rewriting is a fundamental mechanism for uniformly querying heterogeneous ontologies in the Linked Data Web. However, the complexity of ontology alignments, particularly rich correspondences (c: c), makes this process challenging. Existing approaches primarily focus on simple (s: s) and par tially complex (s: c) alignments, thereby overlooking the challenges posed by more expressive alignments. Moreover, the intricate syntax of SPARQL presents a barrier for non - expert users seeking to fully exploit the knowledge encapsulated in ontologies. T his article proposes an innovative approach for the automatic rewriting of SPARQL queries from a source ontology to a target ontology, based on a user's need expressed in natural language. It leverages the principles of equivalence transitivity as well as the advanced capabilities of large language models such as GPT - 4 . By integrating these elements, this approach stands out for its ability to efficiently handle complex alignments, particularly (c: c) correspondences, by fully exploiting their expressivene ss. Additionally, it facilitates access to aligned ontologies for users unfamiliar with SPARQL, providing a flexible solution for querying heterogeneous data. I n the Linked Data Web, aligned ontologies play a crucial role in facilitating interoperability between different data sources.


Teaching Old Tokenizers New Words: Efficient Tokenizer Adaptation for Pre-trained Models

Purason, Taido, Chizhov, Pavel, Yamshchikov, Ivan P., Fishel, Mark

arXiv.org Artificial Intelligence

Tokenizer adaptation plays an important role in transferring pre-trained language models to new domains or languages. In this work, we address two complementary aspects of this process: vocabulary extension and pruning. The common approach to extension trains a new tokenizer on domain-specific text and appends the tokens that do not overlap with the existing vocabulary, which often results in many tokens that are unreachable or never used. We propose continued BPE training, which adapts a pre-trained tokenizer by continuing the BPE merge learning process on new data. Experiments across multiple languages and model families show that this approach improves tokenization efficiency and leads to better utilization of added vocabulary. We also introduce leaf-based vocabulary pruning, which removes redundant tokens while preserving model quality. Together, these methods provide practical tools for controlled vocabulary modification, which we release as an open-source package.


Towards responsible AI for education: Hybrid human-AI to confront the Elephant in the room

Hooshyar, Danial, Šír, Gustav, Yang, Yeongwook, Kikas, Eve, Hämäläinen, Raija, Kärkkäinen, Tommi, Gašević, Dragan, Azevedo, Roger

arXiv.org Artificial Intelligence

Despite significant advancements in AI-driven educational systems and ongoing calls for responsible AI for education, several critical issues remain unresolved -- acting as the elephant in the room within AI in education, learning analytics, educational data mining, learning sciences, and educational psychology communities. This critical analysis identifies and examines nine persistent challenges that continue to undermine the fairness, transparency, and effectiveness of current AI methods and applications in education. These include: (1) the lack of clarity around what AI for education truly means -- often ignoring the distinct purposes, strengths, and limitations of different AI families -- and the trend of equating it with domain-agnostic, company-driven large language models; (2) the widespread neglect of essential learning processes such as motivation, emotion, and (meta)cognition in AI-driven learner modelling and their contextual nature; (3) limited integration of domain knowledge and lack of stakeholder involvement in AI design and development; (4) continued use of non-sequential machine learning models on temporal educational data; (5) misuse of non-sequential metrics to evaluate sequential models; (6) use of unreliable explainable AI methods to provide explanations for black-box models; (7) ignoring ethical guidelines in addressing data inconsistencies during model training; (8) use of mainstream AI methods for pattern discovery and learning analytics without systematic benchmarking; and (9) overemphasis on global prescriptions while overlooking localised, student-specific recommendations. Supported by theoretical and empirical research, we demonstrate how hybrid AI methods -- specifically neural-symbolic AI -- can address the elephant in the room and serve as the foundation for responsible, trustworthy AI systems in education.


TurBLiMP: A Turkish Benchmark of Linguistic Minimal Pairs

Başar, Ezgi, Padovani, Francesca, Jumelet, Jaap, Bisazza, Arianna

arXiv.org Artificial Intelligence

We introduce TurBLiMP, the first Turkish benchmark of linguistic minimal pairs, designed to evaluate the linguistic abilities of monolingual and multilingual language models (LMs). Covering 16 linguistic phenomena with 1000 minimal pairs each, TurBLiMP fills an important gap in linguistic evaluation resources for Turkish. In designing the benchmark, we give extra attention to two properties of Turkish that remain understudied in current syntactic evaluations of LMs, namely word order flexibility and subordination through morphological processes. Our experiments on a wide range of LMs and a newly collected set of human acceptability judgments reveal that even cutting-edge Large LMs still struggle with grammatical phenomena that are not challenging for humans, and may also exhibit different sensitivities to word order and morphological complexity compared to humans.